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Abstract 



This paper studies the problem of ergodicity of transition probability matrices in Marko- 
vian models, such as hidden Markov models (HMMs), and how it makes very difficult the 
task of learning to represent long-term context for sequential data. This phenomenon hurts 
the forward propagation of long-term context information, as well as learning a hidden 
state representation to represent long-term context, which depends on propagating credit 
information backwards in time. Using results from Markov chain theory, we show that 
this problem of diffusion of context and credit is reduced when the transition probabilities 
approach or 1, i.e., the transition probability matrices are sparse and the model essen- 
tially deterministic. The results found in this paper apply to learning approaches based on 
continuous optimization, such as gradient descent and the Baum- Welch algorithm. 

1. Introduction 

Problems of learning on temporal domains can be significantly hindered by the presence 
of long-term dependencies in the training data. A sequence of random variables (e.g., 
a sequence of observations {yi,y 2 , ■ • -2/t, • • -2/xIj denoted yj) is said to exhibit long-term 
dependencies if the variables y t at a given time t are significantly dependent on the variables 
y t at much earlier times to <C t. In these cases, a system trained on this data (e.g., to 
model its distribution, or make classifications or predictions) has to be able to store for 
arbitrarily long durations bits of information in its state variable, called x t here. In general, 
the difficulty is not only to represent these long-term dependencies, but also to learn a 
representation of past context which takes them into account. Recurrent neural networks 
(Rumelhart, Hinton, & Williams, 1986; Williams & Zipser, 1989), for example, have an 
internal state and a rich expressive power that provide them with the necessary long-term 
memory capabilities. 

Algorithms that could efficiently learn to represent long-term context would be useful in 
many areas of Artificial Intelligence. For example, they could be applied to many problems 
in natural language processing, both at the symbolic level (e.g., learning grammars and 
language models), and subsymbolic level (e.g., modeling prosody for speech recognition or 
synthesis). 

In order to train the learning system, however, an effective mechanism of credit assign- 
ment through time is needed. To change the parameters of the system in order to change 
the internal state of the system at time t, so as to "improve" the internal state of the system 
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later in the sequence, one can recursively propagate credit or error information backwards 
in time. For example, the Baum-Welch algorithm for HMMs (Baum, Petrie, Soules, & 
Weiss, 1970; Levinson, Rabiner, & Sondhi, 1983) and the back-propagation through time 
algorithm for recurrent neural networks (Rumelhart et al., 1986) rely on such kind of re- 
cursion. Numerous gradient-descent based algorithms have been proposed for solving the 
credit assignment problems in recurrent networks (e.g., Rumelhart et al., 1986; Williams & 
Zipser, 1989). Yet, many researchers have found practical difficulties in training recurrent 
networks to perform tasks in which the temporal contingencies present in the input/output 
sequences span long intervals (Bengio, Simard, & Frasconi, 1994; Mozer, 1992; Rohwer, 
1994). Bengio et al. (1994) have also found theoretical reasons for this difficulty and proved 
a negative result for parametric dynamical systems with a non-linear state to next-state 
recurrence 1 x t = ft(xt-i). it will be increasingly difficult to train such as system with 
gradient descent as the duration of the dependencies to be captured increases. Let J be 
the matrix of partial derivatives of the state to next-state function, J 8 j = qJ *'' . • A math- 
ematical analysis of the problem shows that, depending on the norm |J| of the Jacobian 
matrix J, one of two conditions arises in such systems. When |J| < 1, the dynamics of 
the network allow it to reliably store bits of information for arbitrary durations, even with 
bounded input noise; however, gradients with respect to an error at a given time step van- 
ish exponentially fast as one propagates them backward in time. On the other hand, when 
| J | > 1, gradients can flow backward, but the system is locally unstable and cannot reliably 
store bits of information for a long time. Bengio et al. (1994) showed how this hurts the 
learning of long-term dependencies by putting exponentially more weight on the influence 
of short-term dependencies (in comparison to long-term dependencies) over the gradient 
of a cost function with respect to trainable parameters. The above negative result applies 
to non-linear parameterized dynamical systems such as most recurrent networks, but not 
to linear probabilistic models such as hidden Markov models (HMMs). These models are 
a special case of our previous result in which the oo-norm \ J\ = 1, because this matrix is 
a stochastic matrix, i.e., a matrix A of transition probabilities Aij = P{x t = j\xt-\ = i), 
where the state variable x t can take a finite number of values. 

The main contribution of this paper is therefore an extension of the negative results 
found by Bengio et al. (1994) to the case of Markovian models, which include standard 
HMMs (Baum et al., 1970; Levinson et al., 1983) as well as variations of HMMs such as 
Input/Output HMMs (IOHMMs) (Bengio k Frasconi, 1995b), and Partially Observable 
Markov Decision Processes (POMDPs) (Sondik, 1973, 1978; Chrisman, 1992). We find 
that in general, a phenomenon of diffusion of context and credit assignment, due to the 
ergodicity of the transition probability matrices, hampers both the representation and the 
learning of long-term context in the hidden state variable. 

Both homogeneous and non-homogeneous Markovian models are considered. Homoge- 
neous here means that the transition probabilities of the Markov model are constant over 
time t. Non-homogeneous means that these transition probabilities are allowed to be dif- 
ferent for each time step, e.g., as a function of an external input that may be different at 
each time step. In the homogeneous case (e.g., standard HMMs), such models can learn the 
distribution P(yJ) of output sequences yj = y 1 , y 2 , . . . , yj by associating an output distri- 

1. For example, in the case of a recurrent neural network with recurrent weight matrix W and input vector 
ut at time t, the next-state recurrence is ft(xt—i) = tanh(Wa;t-i + «t) 
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bution P(y t \x t = i) to each value i of the discrete state variable x t . In the non-homogeneous 
case, transition and output distributions are conditional on an input sequence, allowing to 
model relationships between input and output sequences. In the case of IOHMMs (Bengio 
& Frasconi, 1995b), one thus learns a model P(yJ\uJ) of the conditional distribution of 
an output sequence yj when an input sequence uj is given. This can be used to perform 
sequence regression or classification, as with recurrent networks. In the case of POMDPs 
(Sondik, 1973, 1978; Chrisman, 1992), used to control a process with a hidden state, one 
wants not only to build such a model, but also to select a proper sequence aj of (discrete) 
actions in order to maximize a discounted sum of future rewards that depends on the ac- 
tion taken, the observed output sequence yj and the estimated distribution of the state 
trajectory. Note that the sequence of actions aj in POMDPs and the sequence of inputs 
uj in IOHMMs play a similar role in this paper, inasmuch as both are responsible for the 
non-homogeneity of the Markov chain. In the following, we shall use the same symbol uj 
to denote the sequence that controls transition probabilities, i.e. inputs for IOHMMs and 
actions for POMDPs. 

The negative results presented in this paper are directly applicable to learning algorithms 
such as the EM algorithm (Dempster, Laird, & Rubin, 1977) or other gradient-based opti- 
mization algorithms, which rely on gradually and iteratively modifying continuous-valued 
parameters (such as transition probabilities, or parameters of a function computing these 
probabilities) in order to optimize a learning criterion. 

2. Mathematical Preliminaries 

A first-order Markovian model is defined by a discrete set of states {1, . . . n}, a probabilis- 
tic transition function (state to next-state), and a probabilistic output function (state to 
output). The discrete state variable x t can take values in {1, . . .n} at each time step. We 
will write Aij for the element of a matrix A, A n = AA . . . A for the ra th power of A, 

and (A n )ij for the element of A n . See (Rabiner, 1989) for an introduction to HMMs, 

and (Seneta, 1981) for a basic reference on positive matrices. 

The Markovian independence assumption implies that the state variable x t summarizes 
the past of the sequence: P{x t \x\, X2, ■ ■ ■ , x t -\) = P{x t \x t -i). Another independence as- 
sumption, when the state x t is hidden but an output y t is observed, is that the distribution 
of y t at time t does not depend on the other past variables when x t is given. State transi- 
tions at time t may depend on the u t (the current input for IOHMMs or the current action 
for POMDPs) and can be collected into an n by n transition matrix A t defined by 

A tJ (u t ) = P(x t = j | x t -! = i,u t ;0) 

where is a vector of adjustable parameters. In the homogeneous case, the transition 
matrix is constant, i.e., A t = A. The parameters are then usually directly identified with 
the elements of the transition matrix A. 

Output emissions y t depend on u t and the present state, as specified by the output 
(also called emission) distribution P(y t | 2^,1^;$), with parameters i9. For example, if the 
Markov chain is homogeneous and the output values belong to a finite alphabet of cardinality 
k, then the parameters i9 can be collected in a k by n matrix B, B>u = P(y t = l\x t = i). 



251 



Bengio & Frasconi 



An output sequence yj can be generated according to the distribution P(yJ\uJ) (non- 
homogeneous case) or P(yJ) (homogeneous case) represented by the model, as follows. 
First an initial state xq is selected according to a distribution P{xq) on initial states (usu- 
ally multinomial, sometimes requiring n — 1 extra parameters, or a fixed choice of a sin- 
gle initial state). Then the state x t can be recursively picked in function of the previ- 
ous state Xt-i, by choosing an x t G according to the multinomial distribution 
P(xt\xt_i, u t ; 0). At each time step, an output can then be generated according to the 
distribution P(y t \ xt,ut; , d). 

State transitions can be constrained by a directed graph Q, whose nodes are associated 
to the states of the Markov chain. In particular, the probability P{x t = i \ x t -\ = j) will 
be constrained to be zero if there is no edge from node j to node i. 

2.1 Learning in Markovian Models 

The learning objective is often to maximize the output likelihood P(yj;0), or the out- 
put likelihood given the input P(yJ \ uj; ©), where comprises all the parameters of the 
model. This can be accomplished with an EM algorithm when the form of the output 
and transition probability models are simple enough, e.g. in the case of HMMs (Baum 
et al., 1970; Levinson et al., 1983; Rabiner, 1989) or IOHMMs (Bengio k Frasconi, 1995b). 
Alternatives, for maximizing the output likelihood or other criteria (such as the more dis- 
criminant mutual information between the output sequence and the correct model, Bahl 
et al. 1986), are usually based on some gradient-based optimization algorithm, requiring 
the computation of the gradient of the learning criterion with respect to the model parame- 
ters. In all of these cases, the learning algorithms perform products involving the transition 
probability matrices (Bengio & Frasconi, 1995a, 1995b), such as 

«*,t = P{y{,xt = i | u\) = P(y t | x t = i,ut) Y,i A u{ u t)at,t-i ^ 
A',t = P(yJ I x t = i,uj) = J2tA t t(ut + i)P(yt | x t+ i = l,ut + i)f3 i} t + i. 

where the overall output likelihood is obtained from the final time step: 

P(yJ\uJ) = J2^,T. 

i 

Note that if L is the learning criterion and /3 8j x = g„ L T i then /3 8jt = ^ ^ . In terms of 
matrices, we can write 

a t = A t A' t ■ ■ -AiAia ( ] 

(3 t = A t Af-A T A T (3 T [Z> 

where OL t = [ai,t • • .a n ,t]', /3 t = \fli,t ■ ■ ■ Pn,t]' an d A t is a diagonal matrix of emission prob- 
abilities P(y t \xt = i,u t ) (for the i th element). The matrix A t contains the transition 
probabilities at time t, i.e. {A t )ij = P(xt = j \ x t -\ = i, u t ; 9). It can be easily verified that 
the compact notation 

A^) = A t0 A t0+1 ■ ■ ■ At-Ut (3) 
for products of matrices 2 can be used to describe the effect of the distribution of the state x to 
at time to on the distribution of the state x t at time t > t^. Af°'^ = P(x t = j \ x to = i, u\ o ; 6). 

2. To verify equation (3), just apply recursively the simple decomposition rule of probabilities P(a) = 
£ b P(a|b)P(b). 
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Therefore, we will study how this product evolves under various conditions, when t — to in- 
creases (for long-term dependencies). We will find in what (rather general) conditions 
j\{to,t) tends to become ill-conditioned, more precisely, when x t becomes more and more 
independent of x to as t — to increases. In Section 4.2, we also discuss equations (2) as T — t 
increases. In the following subsection we first introduce some standard mathematical tools 
for studying such products of non-negative matrices. 

2.2 Definitions 

Definition 1 (Non-negative matrices) A matrix A is said to be non-negative, written 
A>0, ifAij > Vi,j. 

Positive matrices are defined similarly. 

By extension, we will also write A > B when Vi,j, Aij > P>ij. 

Definition 2 (Stochastic matrices) A non-negative square matrix A £ R nXn { s called row 
stochastic (or simply stochastic in this paper) if J2]j=i = 1 Vi = 1 . . .ra. 

Definition 3 (Allowable matrices) A non-negative matrix is said to be row [column] al- 
lowable if every row [column] sum is positive. An allowable matrix is both row and column 
allowable. 

A non-negative matrix can be associated to the directed transition graph Q that constrains 
the Markov chain. The incidence matrix A corresponding to a given non-negative matrix A 
is the 0-1 matrix obtained by replacing all positive entries of A by a 1. The incidence matrix 
of A is a connectivity matrix corresponding to the graph Q (assumed to be connected here). 
Some algebraic properties of A are described in terms of the topology of Q . Indices of the 
matrix A correspond to nodes of Q (we will also use "states of the model" , talking about a 
Markovian model). 

Definition 4 (Irreducible Matrices) A non-negative nxn matrix A is said to be irreducible 
if for every pair i,j of indices, 3 m = m(i,j) positive integer s.t. (A' m )ij > 0. 

A matrix A is irreducible if and only if the associated graph is strongly connected (i.e., 
there exists a path between any pair of states A reducible matrix is one that is not 

irreducible. If 3k s.t. (A k )a > (i.e., there is a path of length k from node i to itself), 
d(i) is called the period of index i if d(i) is the greatest common divisor (g.c.d.) of those k 
for which (A k )a > (i.e., there are also paths of length k, 2k, 3k, etc., with k = d(i)). In 
an irreducible matrix all the indices have the same period d, which is called the period of 
the matrix. The period of a matrix is the g.c.d. of the lengths of all cycles in the associated 
transition graph Q . 

An example of a periodic matrix of period 3 is illustrated by the graph Q\ of Figure 2. 
All the paths starting from one of the states and returning to it are of length 3k for some 
positive integer k. 

Definition 5 (Primitive matrix) A non-negative matrix A is said to be primitive if there 
exists a positive integer k s.t. A k > 0. 
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Therefore, in a graph with a corresponding primitive matrix, one can always find a path 
of length greater than some k between any two nodes, and if there exists a path of length 
k between nodes i and j, there are also paths of length k + 1, k + 2, etc... In the analysis 
below, we will consider submatrices (and corresponding subgraphs) which are primitive. 
Note that an irreducible matrix is either periodic or primitive (i.e., of period 1), and that 
a primitive stochastic matrix is necessarily allowable. 

2.3 The Perron- Frobenius Theorem 

Right eigenvectors v of a matrix A and their corresponding eigenvalues A have the following 
properties (see Bellman, 1974, for more on eigenvalues and eigenvectors): 



3 

Note that for a stochastic matrix A the largest eigenvalue has norm 1, which can be shown 
as follows. Letting i = argmax |fj|, we obtain 



Hence all the eigenvalues have norm less or equal to 1. Let us define the vector of ones 
l = [1, 1, • • • , 1]', where v' denotes the transpose of v. Since Ai = l by definition of 
stochastic matrices, 1 is an eigenvalue and l is its corresponding right eigenvector. 

The following theorem will be useful in characterizing homogeneous products of stochas- 
tic matrices (as in HMMs). 

Theorem 1 (Perron-Frobenius Theorem) Suppose A is an n X n non-negative primitive 
matrix. Then there exists an eigenvalue r such that: 

1. r is real and positive; 

2. r can be associated with strictly positive left and right eigenvectors; 

3. r > |A| for any eigenvalue A / r; 

4- the eigenvectors associated with r are unique to constant multiples. 

5- If ® < B < A and (3 is an eigenvalue of B, then \fi\ < r. Moreover, \fi\ = r implies 



determinant^ — XI) = 0. 



where / is the identity matrix, and 



Av = Xv 



i.e. 





B = A. 



6. r is a simple root of the characteristic equation determinant ( A — rl) = 0. 
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[See proof in the book by Seneta, 1981, Theorem 1.1.] 

A direct consequence of the Perron-Frobenius theorem for stochastic matrices is therefore 
the following: 

Corollary 1 Suppose A is a primitive stochastic matrix. Then its largest eigenvalue is 1 
and there is only one corresponding right eigenvector l = [1, 1, • • • , 1]'. Furthermore, all 
other eigenvalues are less than 1 in modulus. 

Proof. Ai = l by definition of stochastic matrices. As shown above, all the eigenvalues 
have a modulus less or equal to 1. Thus, we deduce from the Perron-Frobenius Theorem that 
1 is the largest eigenvalue, l is the unique associated eigenvector, and all other eigenvalues 
< l.D 

In the next section we will discuss the consequences of this corollary for HMMs. As 
shown by Seneta (1981), we should also note that if A is stochastic but periodic with period 
d, then A has d eigenvalues of modulus 1 which are the d complex roots of 1. 

3. Ergodicity 

In this section we analyze the case of a primitive transition matrix as well as the general case 
with a so-called canonical re-ordering of the matrix indices (defined below). We introduce 
ergodicity coefficients in order to measure the difficulty in learning long-term dependencies. 

3.1 Simplest Case: Homogeneous and Primitive 

A straightforward application of the Perron-Frobenius theorem and the associated corol- 
lary 1 is given in the following theorem. 

Theorem 2 If A is a primitive stochastic matrix, then as t — > oo, A 1 — > iv' where v' 
is called the unique stationary distribution of the Markov chain. The rate of approach is 
geometric. 

[See proof in the book by Seneta, 1981, Theorem 4.2.] 

The intuition behind the proof simply relies on the fact that when a matrix A is taken 
to a certain power A n , it is equivalent to take its eigenvalues to the same power. As we 
have seen earlier, all the eigenvalues are less or equal to one in modulus. Therefore, the 
eigenvalues of A which are less than 1 are associated to near zero eigenvalues of A n , as 
n — > oo. The only eigenvalues which do not converge to zero are those whose modulus is 1. 
There is only one such eigenvalue in the case of primitive stochastic matrix (associated to 
the eigenvector l). In the case of periodic matrices of period d, discussed below, there are 
complex eigenvalues whose modulus is 1 and which are among the d th roots of unity. 

We recall that the rank of a matrix A is the dimension of the linear subspace spanned 
by the eigenvectors of A and corresponds to the number of linearly independent rows (or 
columns). Since the matrix A obtained by the product iv' of two vectors has rank 1, we 
obtain the following from Theorem 2. If A is primitive, then lim^oo A 1 converges to a 
matrix whose eigenvalues are all except for one eigenvalue A = 1 (with corresponding 
eigenvector l), i.e., the rank of this product converges to 1, which means that its rows are 
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proportional. For a stochastic matrix, row proportionality is equivalent to row equality. 
Since (A t ~ t °)ij = P{x t = j\xt = i) it follows that the distribution over the states at time 
t > to becomes gradually independent of the distribution P(x to ) over the states at time to as 
t — to increases. This is illustrated in Figure 6, which shows products of 1, 2, 3 and 4 random 
primitive stochastic matrices, and rapid convergence to row equality, i.e., P{x t = j\xt = i) 
does not depend any more on i as t — to becomes large. It means that, as one moves forward 
in time, context information is diffused, and gradually lost. A consequence of Theorem 2 
is therefore that it is very difficult to model long-term dependencies in sequential data 
using a homogeneous HMM with a primitive transition matrix. After having introduced 
ergodicity coefficients in the next sections, we will be able to discuss the more general case 
of non-homogeneous models (such as IOHMMs and POMDPs), as well as, comment on the 
diffusion of context information in the forward and backward HMM equations (2). 

3.2 Coefficients of ergodicity 

To study products of non-negative matrices and the loss of information about initial state 
in Markov chains (particularly in the non-homogeneous case), we will define two coefficients 
of ergodicity. First, we introduce the projective distance between vectors v and w: 



d(v ', w') = max ln( 



Note that some form of contraction takes place when d(v'A, w'A) < d(v', w') (Seneta, 1981) , 
i.e., applying the linear operator A to the vectors v' and w' brings them "closer" (according 
to the above projective distance). 

Definition 6 Birkhoff's contraction coefficient tb(A), for a non-negative column- allowable 
matrix A, is defined in terms of the projective distance: 

. t , d(v'A, w'A) 

t b (A) = sup — — — — . 

v,w>0;v^\w a \ v > w ) 

Dobrushin's coefficient Ti(A), for a stochastic matrix A, is defined as follows: 

^i(A) = ^ max V |a jfc - Oj fc |. (4) 
2 M k 

Both tb and T\ are called proper ergodicity coefficients, i.e., they have the properties that, 
firstly, < t(A) < 1, and secondly, that t(A) = if and only if A has identical rows (and 
therefore rank 1). The coefficients of ergodicity quantify the ergodicity of a matrix, i.e., at 
what rate a power of the matrix converges to rank 1. Furthermore, t(AiA2) < t(Ai)t(A2) 
(Seneta, 1981). Therefore, as discussed in the next section, these coefficients can also be 
applied to quantify how fast a product of matrices converges to rank 1. 

3.3 Products of Stochastic Matrices 

Let A^ 1 '*) denote a forward product of stochastic matrices A\, A2, • • • A t . From the properties 
of tb and t\, if r(A t ) < 1, Mt > then lim^^ r(^( 1 '*)) = 0, i.e., lim^^ A^ 1 -*) has rank 1 
and identical rows. Weak ergodicity of a product of matrices is then defined in terms of a 
proper ergodic coefficient r (such as tb or T\) converging to 0: 
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Definition 7 (Weak Ergodicity) The products of stochastic matrices A(* '*) are weakly er- 
godic if and only if for all t > as t — > oo, r(A(* '*)) — > 0. 

The following theorem relates weak ergodicity to rank lossage in products of stochastic 
matrices and, therefore to the problem of learning and representing long-term context. 

Theorem 3 Let A^ 1 '*) be forward products of non-negative and allowable matrices; then 
A^ 1 '*) is weakly ergodic if and only if the following conditions both hold: 

1. 3t s.t. A(* '*) > Vt>t ; 

A {t0,t) 

2. \ k tQtt) —7- Wij(t) > as t — > oo, i.e., rows of A'* '*' tend to proportionality. 
[See the proof in the book by Seneta (1981), Lemma 3.3 and 3.4.] 

For stochastic matrices, row-proportionality (2nd condition above) is equivalent to row- 
equality since rows sum to 1. Note that the limit lim^oo A(* '*) itself does not need to exist 
in order to have weak ergodicity. If such a limit exists and it is a matrix with all rows equal, 
then the product is said to be strongly ergodic. 



3.4 Canonical Decomposition and Periodic Graphs 

Any non-negative matrix A can be rewritten by relabeling its indices in the following canon- 
ical decomposition (Seneta, 1981), with diagonal blocks Bi, C\ and Q: 







A 





B 2 



\ 




C 



s+l 









L 2 



C r 
L r Q 



Primitive diagonal 
blocks B\ , . . . , B s 

Periodic diagonal 
blocks C s +i, . . . , C r 



(5) 



/ 



where the Bi and C\ blocks are irreducible, the Bi blocks are primitive and the C\ blocks 
are periodic. Define the corresponding sets of states as Sb,, Sc\, Sq. Q might be reducible, 
but the groups of states in Sq leak into the B or C blocks, i.e., Sq represents the transient 
part of the state space. This decomposition is illustrated in Figure 1. We will consider 
three cases: paths starting from a state in Bi, Q or C;. In the first case, for homogeneous 
and non-homogeneous Markov models (with constant incidence matrix A t = Ao), because 
P(xt G Sgl^t-i G Sq) < 1, lim^oo P{x t G Sq\xq G Sq) = 0. In the second case, because 
the Bi are primitive, we can apply Theorem 1 to these sub-matrices, and starting from a 
state in Sb,, all information about an initial state at to is gradually lost. 



3.5 Periodic Graphs 

A more difficult case to analyze is the third case, i.e., that of paths from state j at time to 
to state k at time t, with initial state j G Sc t associated to a periodic block. Let di be the 
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Figure 1: Transition graph corresponding to the canonical decomposition. Farge dotted 
circles represent subgroups of states associated to submatrices Bi, C'i, and Q 
in equation (5). The large arrows on the upper right area generically represent 
transitions from some states in Q to some states in Bi and C\. Transitions among 
states in each subgroup are depicted inside the large circles. 




Figure 2: Periodic Q\ becomes primitive (period 1) (/_> when adding loop with states 4,5. 
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period of the i th periodic block C{. It can be shown (Seneta, 1981) that taking d products 
of periodic matrices with the same incidence matrix and period d yields a block-diagonal 
matrix whose d blocks are primitive. Thus a product C'^ '^ retains information about the 
initial block in which x to was. However, for every such block of size > 1, information will 
be gradually lost about the exact identity of the state within that block. 

This is best demonstrated through a simple example. Consider the incidence matrix 
represented by the graph Q\ of Figure 2. It has period 3 and the only non-deterministic 
transition is from state 1, which can yield into either one of two loops. When many stochastic 
matrices with this graph are multiplied together, information about the loop in which the 
initial state was is gradually lost (i.e., if the initial state was 2 or 3, this information is 
gradually lost). What is retained is the phase information, i.e., in which block ({0}, {1}, 
or {2,3}) of a cyclic chain was the initial state. This suggests that it will be easy to learn 
about the type of outputs associated to each block of a cyclic chain, but it will be hard 
to learn anything else. Suppose now that the sequences to be modeled are slightly more 
complicated, requiring an extra loop of period 4 instead of 3, as in Figure 2. In that case A 
is primitive: all information about the initial state will be gradually lost. 



4. Representing and Learning Long-Term Context 

Based on the analysis of the previous section, which apply both the homogeneous and non- 
homogeneous cases, we find in this section that in order to absolutely avoid all diffusion 
of context and credit information (both learning and representing context) , the transitions 
should be deterministic (0 or 1 probability). For HMMs, this unfortunately corresponds to 
a system that can only model cycles (and is therefore not very useful for most applications). 
Both learning and representing context are hurt by the same ergodicity phenomenon because 
the state to next state transformation is linear, i.e., forward and backward propagation are 
symmetrical. 

We discuss the practical impact of this ergodicity problem for incremental learning 
algorithms (such as EM and gradient ascent in likelihood). 



4.1 Learning Long-Term Dependencies: a Discrete Problem? 

To better understand the problem, it is interesting to look at a particular instance of the 
EM algorithm for HMMs, more specifically, at a form of the update rule for transition 
probabilities, 



A? «~ y, , l JSL_ ■ ( 6 ) 

1-^3 A% 3 dAij 

where L is the likelihood of the training sequences. We might wonder if, starting from a 
positive stochastic matrix, the learning algorithm could learn the topology, i.e., replace some 
transition probabilities by zeroes. Starting from Aij > we could obtain a new Aij = only 
if -^j- = 0, i.e., on a local maximum of the likelihood L. Thus the EM training algorithm 
will not exactly obtain zero probabilities. Transition probabilities might however approach 
0. Furthermore, once Aij has taken a near-zero value, it will tend to remain small. This 
suggests that prior knowledge (or initial values of the parameters), rather than learning, 
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should be used, if possible, to determine the important elements of the topology, and for 
establishing the long-term relations between elements of the observed sequences. 

It is also interesting to ask in which conditions we are guaranteed that there will not 
be any diffusion (of influence in the forward phase, and credit in the backward phase of 
training). It requires that all of the eigenvalues have a norm that is 1. This can be achieved 
with periodic matrices C (of period d), which have d eigenvalues that are the d roots of 1 on 
the complex unit circle. To avoid any loss of information also requires that C d = I he the 
identity, since any diagonal block of C d with size more than 1 will bring a loss of information 
(because of ergodicity of primitive matrices). This can be generalized to reducible matrices 
whose canonical form is composed of periodic blocks C; with Cf = /. 

The condition we are describing actually corresponds to a matrix with only l's and 
O's. For this type of matrix, the incidence matrix A t of A t is equal to the matrix A t itself. 
Therefore, when A t is fixed, the Markov chain is also homogeneous. It appears that many 
interesting computations cannot be achieved with such constraints (i.e., only allowing one or 
more cycles of the same period and a purely deterministic and homogeneous Markov chain). 
Furthermore, if the parameters of the system are the transition probabilities themselves (as 
in ordinary HMMs), such solutions correspond to a subset of the corners of the 0-1 hypercube 
in parameter space. Away from those solutions, learning is mostly influenced by short term 
dependencies, because of diffusion of credit. Furthermore, as seen in equation (6), algorithms 
like EM will tend to stay near a corner once it is approached. This suggests that discrete 
optimization algorithms, rather continuous local algorithms, may be more appropriate to 
explore the (legal) corners of this hypercube. 

Examples of to this approach are found in the area of grammar inference for natural 
language modeling (e.g., variable memory length Markov models, Ron et al., 1994, or con- 
structive algorithms for learning context-free grammars, Lari & Young, 1990, Stolcke & 
Omohundro, 1993). The problem of diffusion studied here applies only to algorithms that 
use gradient information (such as the Baum- Welch and gradient-based algorithms) and a 
gradual modification of transition probabilities. It would be interesting to evaluate how 
such constructive and discrete search algorithms perform when properly solving the task 
requires to learn to represent long-term context. On the basis of the results of this paper, 
however, we believe that in order to successfully learn long-term dependencies, such algo- 
rithms should look for very sparse topologies (or very deterministic models). Note that some 
of the already proposed approaches (Ron et al., 1994) are limited in the type of context 
that can be represented (e.g., no loops in the graph and the constraint that all intermediate 
observations between times to and t must be represented by the state variable in order to 
model the influence of y t on y t ). 

4.2 Diffusion of Credit 

We have already found above that except in the special case of or 1 transition probabilities, 
the state variable becomes more and more independent of remote past states (and therefore 
of remote past inputs and outputs). Since this prevents robustly representing long-term 
context, learning such a long-term context is also made more and more difficult for longer 
term dependencies. 
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However, it is interesting to consider how the ergodicity of the transition probability 
matrix directly affects the the forward-backward equations (2) (to propagate context in- 
formation forward and backward) used in learning algorithms such as EM and (implicitly) 
gradient descent. In particular, let us consider the Dobrushin ergodicity coefficient of these 
matrix products. First, let Vj = A t A t ■ ■ ■ Ay Ay, then 

T 

Ti(V t ) < Ti(Af)Ti(Af) • ■■t 1 (A t )t 1 (A t ) = n n(A T )Ti(A T ) (7) 

T = t 

We have already seen that Ti(A t ) < 1 unless the transition probabilities are all or 1. 
Remember that the emission probability matrix A T is diagonal. Applying the definition of 
T\ (equation 4) to a diagonal matrix D, we obtain 

T i( D ) = t: maxf| Da - Aj| + | Aj - Djj\) = 7: maxf | Du\ + \Djj\) with i / j. 
Therefore, 

r i( A t) = \ m & x (P(yt\ x t = i,ut) + P(y t \ x t = j,ut)) with i / j, 
2 1,3 v ' 

which is the average of the two largest emission probabilities at this time step. Therefore, 
when the transition probabilities are not all or 1, in the case of discrete outputs, Ti(A t ) < 1, 
and the ergodicity coefficient of the matrix product Vj in equation (7) converges to as 
T — t increases. Note that this product gives the gradient of a^y with respect to oij^ (from 
equation 1) and is used in the EM algorithm (Baum et al., 1970; Levinson et al., 1983) as 
well as in gradient-based algorithms (Bridle, 1990; Bengio, De Mori, Flammia, & Kompe, 
1992; Bengio & Frasconi, 1995b). 

For example, in the case of a learning criterion L, 

dL _ dL 

where 4^- is the vector \ S JL . . . a dL 1. Since Vt is used to propagate credit backwards, 
its convergence to rank 1 means that long-term credit is gradually lost as it is propagated 
backwards: the gradient of the learning criterion with respect to all the past states becomes 
the same, i.e., J^- converges to a multiple of [1, 1, . . ., 1]. 

The continuous emissions case is more difficult because the density P(y t \x t = i,u t ) 
can locally be greater than one. The above result can still be obtained if we restrict our 
attention to the cases in which the product of the largest emission probabilities at each time 
step is bounded, which is the most likely in practice. In the case where it is not bounded, we 
conjecture that the same result can be obtained by considering scaled emission probability 
matrices, with a scaling factor s t that is 1 when the emission probability is less than 1, and 
that is 1/ max, P(y t \x t = i,u t ) otherwise. Letting Ut = A t s t A t ■ ■ ■ AysyAy, although the 
overall gradient with respect to all the past states can grow very large (as T — t increases) , 
the rank of Ut still converges to 1, and the vector (3 t = also converges to a (possibly 
very large) multiple of [1, 1, . . . , 1]. 
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Figure 3: Typical problem with short-term dependencies hiding the long-term dependen- 



In practice we train HMMs with finite sequences. However, training will become more 
and more numerically ill-conditioned as one considers longer term dependencies. Consider 
as in Figure 3 two events e t (occurring at time t) and e T (occurring at time r much earlier 
than t), and suppose there are also "interesting" events occurring in between (i.e., events 
which should influence the state variable at time t in order to better model outputs at time 
t or later). Let us consider the overall influence of states at times s < t upon the likelihood 
of the outputs at time t. Because of the phenomenon of diffusion of credit, and because 
gradients are added together, the influence of intervening events (especially those occurring 
shortly before t) will be much stronger than the influence of e T . Furthermore, this problem 
gets geometrically worse as t — t increases. 

4.3 Sparse Matrices and Prior Knowledge 

Clearly a positive matrix (corresponding to a fully-connected graph) is primitive. Thus in 
order to learn long-term dependencies, we would like to have many zeros in the matrix of 
transition probabilities (which reduces the problem of diffusion, as confirmed by the exper- 
iments described in Section 5 and illustrated in Figure 5). Unfortunately, this generally 
supposes prior knowledge of an appropriate connectivity graph. In practical applications of 
HMMs, for example to speech recognition (Lee, 1989; Rabiner, 1989) or protein secondary 
structure modeling (Chauvin & Baldi, 1995), prior knowledge is heavily used in setting 
up the connectivity graph. As illustrated in Figure 4, in speech recognition systems the 
meaning of individual states is usually fixed a-priori except within phoneme models. The 
representation of long-term context is therefore not learned by the HMM. Transition prob- 
abilities between groups of states representing a phoneme in a certain context are "learned" 
from text or labeled speech data. However, in that case the "model" is a Markov model, 
not a hidden Markov model: learning consists in counting co-occurrence of events such as 



cies. 
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meaning of states 
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Figure 4: Learning of a representation of context in speech recognition HMMs is typically 
limited to what happens within a phoneme. Higher-level representations are cho- 
sen from prior knowledge and those parameters are often estimated from simple 
co-occurrence statistics. 



phonemes or words. The hard problem of learning a representation of context is therefore 
avoided by choosing it on the basis of prior knowledge. 

Another direction of research should be in ways to incorporate some prior knowledge 
with learning from examples, preferably in a way that simplifies the problem of learning 
(new) long-term dependencies. Our current research in this direction is based on the old 
AI idea of using a multi-scale representation. The state variable is decomposed into several 
"sub-state" variables (whose Cartesian product is equal to the "full" state variable), each 
operating at a different time scale. The a-priori assumption is that long-term context will 
be represented by "slow" state variables, which must be insensitive to the precise timing of 
events. This allows the propagation of context (and credit, for learning) over long durations 
through those higher-level state variables. To impose these multiple time scales, one can 
introduce constraints on the transition probabilities, such that the "slow" variables always 
have a small probability of changing at any time step. Another useful assumption is that the 
transition probabilities can be factored in terms of the conditional sub-state probabilities at 
each time scale, given the full state. We conjecture that the hypothesis behind this multi- 
scale structure is appropriate for most "natural" sequence learning tasks (such as those 
humans perform). 
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5. Experiments 

In this section we report some experimental results. Firstly, we study, from a numerical 
point of view, the convergence of products of stochastic matrices. Then we report an 
example of training in a problem in which the span of temporal dependencies is artificially 
controlled. 

5.1 Diffusion: Numerical Simulations 

In this experiment we measure how (and if) different kinds of products of stochastic ma- 
trices converged, for example to a matrix with equal rows. We ran 4 simulations, each with 
an 8-state non-homogeneous Markov chain but with different constraints on the transition 
graph: 1) Q is fully connected; 2) Q is a left-to-right model (i.e., the incidence matrix A is 
upper triangular); 3) Q is left-to-right but only one-state skips are allowed (i.e., A is upper 
bidiagonal); 4) A t are periodic with period 4. Results shown in Figures 5 and 6 confirm the 
convergence towards zero of the ergodicity coefficient at a rate that depends on the graph 
topology. The exception is, as expected, the case of periodic matrices. Note how the sparser 
graphs have a larger ergodicity coefficient, which should ease the learning of long-term de- 
pendencies. In Figure 6, we represent visually the convergence of fully connected matrices 
to row equality, in only 4 time steps, towards equal rows. Each of the transition probability 
matrices A t (t = 1, 2, 3, 4) was chosen randomly from a uniform distribution. 

5.2 Training Experiments 

To evaluate how diffusion impairs training, a set of controlled experiments were performed, 
in which the training sequences were generated by a simple homogeneous HMM with long- 
term dependencies, depicted in Figure 7. 

Two branches generate similar sequences except for the first and last symbol. The 
extent of the long-term context is controlled by the self transition probabilities of states 2 
and 5, A = P(x t = 2\x t -\ = 2) = P(x t = 5\x t -\ = 5). Span or "half-life" is log(.5)/ log(A), 
i.e., A s P an = .5). Following Bengio et al. (1994), data was generated for various span of 
long-term dependencies (0.1 to 1000). 

For each series of experiments, varying the span, 20 different training trials were run 
per span value, with 100 training sequences 3 . Training was stopped either after a max- 
imum number of epochs (200), of after the likelihood did not improve significantly, i.e., 
(l(t) — l(t — l))/\l(t) \ < 10~ 5 , where l(t) is the logarithm of the likelihood of the training 
set at epoch t. A trial is considered successful (converged) when it yields a likelihood almost 
as good or better than the likelihood of the generating HMM on the same data. 

If the HMM is fully connected (except for the final absorbing state) and has just the 
right number of states, trials almost never converge to a good solution (1 in 160 did). 
Increasing the number of states and randomly putting zeroes in the transition matrix helps 
convergence. This confirms common intuition, although using more states than strictly 
necessary may result in worse generalization to new examples and, hence, may not be an 
advisable solution to solve convergence problems. The randomly connected HMMs had 3 

3. This relatively small number of training sequences appeared sufficient since the likelihood of the gener- 
ating HMM did not improve much when trained on this data. 
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Figure 5: Convergence of Dobrushin's coefficient (see Definition 6) in product of stochastic 
matrices associated to non-homogeneous Markov chains constrained by different 
transition graphs. The flattening of the bottom curve is due to the limits of 
numerical precision in the computer experiments. 
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Figure 6: Evolution of matrix products A^ 1 '*) for a model having a fully connected transition 
graph. Matrix elements (transition probabilities) are visualized with gray levels. 
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ure 7: Generating HMM. Numbers in state circles denote state indices, numbers out of 
state circles denote output symbols. This HMM was used to generate the training 
data for the experiments summarized in Figure 8. 




Span 

ure 8: Percentage of convergence to a good solution (over 20 trials) for various series 
of experiments as the span of dependencies is increased (by increasing the self- 
transition probabilities of states 2 and 5). The task consists in modeling sequences 
generated by the HMM depicted in Figure 7. 
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times more states than the generating HMM and random connections were created with 
20% probability. Figure 8 shows the average number of converged trials for these different 
types of HMM topologies. In all cases the number of successful trials rapidly drops to zero 
beyond some value of span. In failed trials, the equivalent of states 3 and 6 of the generating 
HMM are usually confused, i.e., these solutions don't take the beginning of the sequence 
into account to represent the distribution of the symbols near the end of the sequence. 
It is interesting to note that HMMs with many more states than necessary but sparse 
connectivity performed much better. Typically, a sparser graph corresponds to a larger 
coefficient of ergodicity (as exemplified in Figure 5), which allows long-term dependencies 
to be represented and learned more easily. 

Another interesting observation is that in many cases, the training curve goes through 
one or more very flat plateaus. Such plateaus could be explained by the diffusion problem: 
the relative gradient with respect to some parameters is very small (thus the algorithm 
appears to be stuck). These plateaus can become a very serious problem when their slope 
approaches numerical precision or their length becomes unacceptable. 

6. Conclusion and Future Work 

In previous work on recurrent networks (Bengio et al., 1994) we had found that, for these 
nonlinear dynamical parameterized systems, propagating credit over the long term was in- 
compatible with storing information for the long term. Basically, with enough non-linearity 
(larger weights) to store long-term context robustly, gradients back-propagated through 
time vanish rapidly. In this paper, we have also found negative results concerning the rep- 
resentation and learning of long-term context, but they apply to Markovian models such as 
HMMs, IOHMMs or POMDPs. For these models, we found that both the representation 
and the learning of long-term context information are tied together. In general, they are 
both hurt by the ergodicity of the transition probability matrix (or submatrices of it). How- 
ever, when the transition probabilities are close to 1 and 0, information can be stored for the 
long term and credit can be propagated over the long term. Like our findings for recurrent 
networks, this suggests that the problem of learning long-term dependencies looks more like 
a discrete optimization problem. It appears difficult for local learning algorithm such as 
EM or gradient descent to learn optimal transition probabilities near 1 or 0, i.e., to learn 
the topology, while taking into account long-term dependencies. This should encourage 
research on alternative (discrete) algorithms for discovering HMM topology (especially for 
representing long-term context), such as those proposed by Stolcke & Omohundro (1993) 
and Ron et al. (1994). Our results suggest that such algorithms should strive to discover 
sparse topologies, or almost deterministic models. The arguments presented here are essen- 
tially an application of established mathematical results on Markov chains to the problem 
of learning long term dependencies in homogeneous and non-homogeneous HMMs. These 
arguments were also supported by experiments on artificial data, studying the phenomenon 
of diffusion of credit and the corresponding difficulty in training HMMs to learn long-term 
dependencies. 

IOHMMs (Bengio & Frasconi, 1994, 1995b) and POMDPs (Sondik, 1973, 1978; Chris- 
man, 1992) are non-homogeneous variants of HMMs, i.e., the transition probabilities are 
function of the input (for IOHMMs) or the action (for POMDPs) at each time t. The re- 
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suits of this paper suggests that such non-homogeneous Markovian models could be better 
suited (in some situations) to representing and learning long-term context. For such mod- 
els, forcing transition probabilities to be near or 1 still allows the system to model some 
interesting phenomena and perform useful computations. In practice, this means that the 
underlying dynamics of state evolution to be modeled should be deterministic. For example, 
a deterministic IOHMM can recognize strings from a deterministic grammar, taking into 
account long-term dependencies (Bengio & Frasconi, 1995b). For HMMs this constraint 
restricts the model to simple cycles, which are not very interesting. 

Our analysis and numerical experiments also suggest that using many more hidden states 
than necessary, with a sparse connectivity, reduces the diffusion problem. Another related 
issue to be investigated is whether techniques of symbolic prior knowledge injection (see, 
e.g., Frasconi, Gori, Maggini, & Soda, 1995) can be exploited to choose good topologies, or 
combine specific a-priori knowledge with learning from examples. 

Based on the analysis presented here, we are also exploring another approach to learning 
long-term dependencies that consists in building a hierarchical representation of the state. 
This can be achieved by introducing several sub-state variables whose Cartesian product 
corresponds to the system state. Each of these sub-state variables can operate at a different 
time scale, thus allowing credit to propagate over long temporal spans for some of these 
variables. 
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